Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

The IODS course is to learn how to use R, RStudio, R Markdown, and GitHub to apply certain statistical methods of data science. The link to my Github repository: https://github.com/yingjchen/IODS-project.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Dec  8 22:10:59 2023"

The text continues here.


chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Fri Dec  8 22:10:59 2023"

Here we go again…

Load the students2014 data

df = read.csv('./data/learning2014.txt', sep = ',', stringsAsFactors = F, header = T)
dim(df)
## [1] 166   7
head(df)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(df)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The students2014 data contains 166 observations and 7 variables (gender, age, attitude, deep, stra, surf, points).

A graphical overview of the data

par(mfrow = c(2,4))
barplot(table(df$gender), main = 'gender')
hist(df$age)
hist(df$attitude)
hist(df$deep)
hist(df$stra)
hist(df$surf)
hist(df$points)

pairs(df[, -1])

Fit a regression model

lm.1 <- lm(points~attitude+deep+stra, data = df)
summary(lm.1)
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

remove the two variables (surf and stra) which did not have a statistically significant relationship with the target variable (exam points) and fit the regression model with the remaining variables

lm.2 <- lm(points~attitude, data = df)
summary(lm.2)
## 
## Call:
## lm(formula = points ~ attitude, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Explanations on the fitted model

R-squareds tell how large proportion of the variance in Y the line explains compared to the total variance in Y .

summary(lm.2)$r.squared
## [1] 0.1905537

In this case, the linear model on attitude explains around 20% of the variation in exam points and therefore leaves about 80% of variation in exam points as something that cannot be explained by (a linear effect of) attitude only.

Produce the following diagnostic plots

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2)) 
plot(lm.2)

  1. Residuals vs. fitted should not show a pattern where the distribution of residuals varies depending on the fitted values. Thus, is is good if the red line showing the mean of data points is close to a horizontal line.

  2. QQ-plot should ideally be on the diagonal line, in which case the residuals are Normally distributed and SEs and P-values of the model coefficients are reliable. In this case, the QQ-plot deviated from the diagonal line.

  3. Residuals vs. leverage should not show points that reach outside the curved areas of Cook’s distance 1. Otherwise, such points would have a high influence on the slope of the linear model, and hence the lm() estimates from such data would not be reliable.


chapter 3: Logistic regression

date()
## [1] "Fri Dec  8 22:10:59 2023"

Read data into R

alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
dim(alc)
## [1] 370  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The alc data presents student achievement in secondary education of two Portuguese schools, containing 370 students and 35 variables.

Choose 4 interesting variables

Choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption:

I am interested in exploring the relationships between high/low alcohol consumption and 4 variables (‘age’, ‘Pstatus’, ‘studytime’, ‘goout’).

I hypothesize that there may be a positive correlation between age and going out with the likelihood of alcohol consumption, while study time is expected to be negatively correlated with alcohol consumption. If parents are apart, students tend to be alcoholic.

Note: age - student’s age (numeric: from 15 to 22) Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) goout - going out with friends (numeric: from 1 - very low to 5 - very high)

Numerically and graphically explorations

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots).

par(mfrow = c(2,2))
hist(alc$age, main = 'age')
barplot(table(alc$Pstatus), main = 'Pstatus')
barplot(table(alc$studytime), main = 'studytime')
barplot(table(alc$goout), main = 'goout')

Most student are 15-18 years old. Most parents are living together.

library(ggplot2)
#alc$studytime = as.factor(alc$studytime)
#alc$goout = as.factor(alc$goout)
#alc$age = as.factor(alc$age)
alc$high_use = as.factor(alc$high_use)
g1 <- ggplot(alc, aes(x = Pstatus, y = alc_use))
g1 + geom_boxplot() + 
   geom_jitter(aes(color=high_use), size=0.4, alpha=0.9) + 
  ylab("alcohol assumption") + theme_classic()

g2 <- ggplot(alc, aes(x = high_use, y = studytime))
g2 + geom_boxplot() +geom_jitter(color="black", size=0.4, alpha=0.9) + ylab("studytime") +theme_classic()

g3 <- ggplot(alc, aes(x = high_use, y = goout))
g3 + geom_boxplot() +geom_jitter(color="black", size=0.4, alpha=0.9) + ylab("goout") + theme_classic()

g4 <- ggplot(alc, aes(x = high_use, y = age))
g4 + geom_boxplot() +geom_jitter(color="black", size=0.4, alpha=0.9) + ylab("age") +theme_classic()

#install.packages('crosstable')
#library(crosstable)
#crosstable(alc, c(age, studytime), by=high_use) 

Logistic regression

glm.1 <- glm(high_use ~ age + studytime + Pstatus + goout, data = alc, family = "binomial")
summary(glm.1)
## 
## Call:
## glm(formula = high_use ~ age + studytime + Pstatus + goout, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.8778     1.8521  -2.094 0.036288 *  
## age           0.1185     0.1094   1.083 0.279024    
## studytime    -0.6276     0.1671  -3.757 0.000172 ***
## PstatusT     -0.1064     0.4070  -0.261 0.793752    
## goout         0.7256     0.1183   6.132  8.7e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 386.18  on 365  degrees of freedom
## AIC: 396.18
## 
## Number of Fisher Scoring iterations: 4
#remove variables with p-values larger than 0.05, and fit the LR model again
glm.2 <- glm(high_use ~ studytime + goout, data = alc, family = "binomial")
summary(glm.2)
## 
## Call:
## glm(formula = high_use ~ studytime + goout, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0686     0.5116  -4.044 5.26e-05 ***
## studytime    -0.6224     0.1660  -3.750 0.000177 ***
## goout         0.7427     0.1175   6.322 2.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 387.41  on 367  degrees of freedom
## AIC: 393.41
## 
## Number of Fisher Scoring iterations: 4
pred.2 = predict(glm.2, newdata = alc[, c( 'studytime' ,'goout')], type = 'response')
pred.2.label = ifelse(pred.2 >= 0.5, TRUE, FALSE)
table(pred.2.label)
## pred.2.label
## FALSE  TRUE 
##   308    62

2x2 cross tabulation of predictions versus the actual values

table( truelabels = alc$high_use, predictions = pred.2.label)
##           predictions
## truelabels FALSE TRUE
##      FALSE   238   21
##      TRUE     70   41
mosaicplot(table( truelabels = alc$high_use, predictions = pred.2.label))

Compute the inaccuracy

Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results.

inaccuracy = (21+70)/nrow(alc)
inaccuracy
## [1] 0.2459459

Compared with the performance of the model with performance achieved by some simple guessing strategy (inaccuracy = 0.5), the logistic regression showed better performance.


chapter 4: Clustering and classification

date()
## [1] "Fri Dec  8 22:11:01 2023"

Load data

library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506  14

A graphical overview of the data

Boston data contains 506 observations and 14 numeric variables.

pairs(Boston)

library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# for continuous variables
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 

# visualize the correlation matrix
corrplot(cor_matrix, method="circle")

Scale the whole dataset and prepare the training set and test set

# center and standardize variables
boston_scaled <- as.data.frame(scale(Boston))
# class of the boston_scaled objects
class(boston_scaled)
## [1] "data.frame"
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...

The scaled Boston data contains 506 observations and 14 variables.

#create a factor variable
boston_scaled <- as.data.frame(boston_scaled)
#boston_scaled$crim <- as.numeric(boston_scaled$crim)
#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
#Divide the dataset to train and test sets,
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

Fit the linear discriminant analysis on the train set

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2549505 0.2450495 0.2623762 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.7854515 -0.8977016 -0.14929469 -0.8523828  0.39401939 -0.8503861
## med_low  -0.1408928 -0.3141000 -0.04298342 -0.5483428 -0.14697258 -0.2790003
## med_high -0.3642391  0.1927470  0.16512651  0.4299369  0.05003221  0.3866833
## high     -0.4872402  1.0170298 -0.08661679  1.0445566 -0.48071276  0.7974988
##                 dis        rad        tax    ptratio       black       lstat
## low       0.8358908 -0.6887728 -0.7246034 -0.3927698  0.38259990 -0.75253470
## med_low   0.2850585 -0.5514748 -0.4733725 -0.1032334  0.32079610 -0.10431333
## med_high -0.3918346 -0.3844363 -0.2877691 -0.3261231  0.06862002  0.08673234
## high     -0.8382711  1.6390172  1.5146914  0.7818116 -0.75762774  0.87829194
##                  medv
## low       0.449021087
## med_low  -0.001666894
## med_high  0.132070749
## high     -0.666101252
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.107122255  0.711496163 -1.06303332
## indus    0.005706519 -0.351111904  0.12931212
## chas    -0.096786117 -0.120677663  0.05449836
## nox      0.382443245 -0.706634934 -1.38997070
## rm      -0.096251501 -0.098382339 -0.19130057
## age      0.254386861 -0.249550217  0.04805445
## dis     -0.121899310 -0.238730774  0.13415876
## rad      3.089745900  0.864823049 -0.19110509
## tax     -0.064689723  0.033550886  0.75640036
## ptratio  0.106045707  0.099821528 -0.35112093
## black   -0.122405329 -0.003069977  0.10430904
## lstat    0.228842668 -0.285440158  0.26813602
## medv     0.216554203 -0.403182346 -0.22117231
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9504 0.0369 0.0127
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

predict the classes with the LDA model on the test data

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       22       9        0    0
##   med_low    8      11        4    0
##   med_high   0      12       15    0
##   high       0       0        0   21

The LDA model showed better predictability in the classes of ‘low’ and ‘med_low’.

k-means algorithm

library(MASS)
data("Boston")
boston_scaled.2 <- as.data.frame(scale(Boston))
class(Boston)
## [1] "data.frame"
dist_eu <- dist(boston_scaled.2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man<- dist(boston_scaled.2, method = 'manhattan')
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
library(ggplot2)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
plot(x = 1:k_max, y = twcss, type = "l" )

# k-means clustering
# choose the number of clusters as 3
km <- kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)


chapter 5: Dimensionality reduction techniques

date()
## [1] "Fri Dec  8 22:11:03 2023"

load data

library(readr)
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(human)
## [1] 155   9

move the country names to rownames

library(tibble)
human_ <- column_to_rownames(human, "Country")
summary(human_)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#graphical overview of the data 
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(human_, progress = FALSE)

There is higher positive correlations between Ado.Birth and Mat.Mor, Edu.Exp and Edu2.FM.

There is higher negative correlations between Life.Exp and Mat.Mor, Edu.Exp and Mat.Mor, Edu2.FM and Mat.Mor, Ado.Birth and Life.Exp, Ado.Birth and Edu.Exp.

PCA

pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2, xlabs=rep(".", nrow(human_)))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Standardize the variables and PCA

human_scale <- scale(human_)
pca_human_scale <- prcomp(human_scale)
biplot(pca_human_scale, choices = 1:2, xlabs=rep(".", nrow(human_scale)))

s <- summary(pca_human)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
s_scale <- summary(pca_human_scale)
pca_pr_scale <- round(1*s_scale$importance[2, ], digits = 5)
pca_pr_scale
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298

tea data

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
#View(tea)
head(tea)
##       breakfast     tea.time     evening     lunch     dinner     always home
## 1     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 2     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 3 Not.breakfast     tea time     evening Not.lunch     dinner Not.always home
## 4 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always home
## 5     breakfast Not.tea time     evening Not.lunch Not.dinner     always home
## 6 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always home
##       work     tearoom     friends     resto     pub       Tea   How    sugar
## 1 Not.work Not.tearoom Not.friends Not.resto Not.pub     black alone    sugar
## 2 Not.work Not.tearoom Not.friends Not.resto Not.pub     black  milk No.sugar
## 3     work Not.tearoom     friends     resto Not.pub Earl Grey alone No.sugar
## 4 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone    sugar
## 5 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## 6 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
##       how       where           price age sex          SPC         Sport age_Q
## 1 tea bag chain store       p_unknown  39   M       middle     sportsman 35-44
## 2 tea bag chain store      p_variable  45   F       middle     sportsman 45-59
## 3 tea bag chain store      p_variable  47   F other worker     sportsman 45-59
## 4 tea bag chain store      p_variable  23   M      student Not.sportsman 15-24
## 5 tea bag chain store      p_variable  48   M     employee     sportsman 45-59
## 6 tea bag chain store p_private label  21   M      student     sportsman 15-24
##   frequency     escape.exoticism     spirituality     healthy     diuretic
## 1     1/day Not.escape-exoticism Not.spirituality     healthy Not.diuretic
## 2     1/day     escape-exoticism Not.spirituality     healthy     diuretic
## 3    +2/day Not.escape-exoticism Not.spirituality     healthy     diuretic
## 4     1/day     escape-exoticism     spirituality     healthy Not.diuretic
## 5    +2/day     escape-exoticism     spirituality Not.healthy     diuretic
## 6     1/day Not.escape-exoticism Not.spirituality     healthy Not.diuretic
##       friendliness     iron.absorption     feminine     sophisticated
## 1 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 2 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 3     friendliness Not.iron absorption Not.feminine Not.sophisticated
## 4 Not.friendliness Not.iron absorption Not.feminine     sophisticated
## 5     friendliness Not.iron absorption Not.feminine Not.sophisticated
## 6 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
##      slimming    exciting    relaxing    effect.on.health
## 1 No.slimming No.exciting No.relaxing No.effect on health
## 2 No.slimming    exciting No.relaxing No.effect on health
## 3 No.slimming No.exciting    relaxing No.effect on health
## 4 No.slimming No.exciting    relaxing No.effect on health
## 5 No.slimming No.exciting    relaxing No.effect on health
## 6 No.slimming No.exciting    relaxing No.effect on health
#dim(tea)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")

#MCA
#install.packages('FactoMineR')
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
plot(mca, invisible=c("ind"), graph.type = "classic")


date()
## [1] "Fri Dec  8 22:11:09 2023"

chapter 6: Analysis of longitudinal data

BPRS data set

#BPRS (long form)
library(ggplot2)
library(dplyr)
library(tidyr)
BPRS <- read.table('https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt', header = T)

BPRSL <- read.csv('./data/BPRSL.txt', header = T, sep = '\t', stringsAsFactors = F)
# Draw the plot
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = scale(bprs)) %>%
  ungroup()

# Plot again with the standardised bprs
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(name = "standardized bprs")

BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise( mean=mean(bprs) ) %>%
  ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

BPRSL8S1 <- BPRSL8S %>%
  filter(mean < 60)
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by treatment
## t = 0.52095, df = 37, p-value = 0.6055
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -4.232480  7.162085
## sample estimates:
## mean in group 1 mean in group 2 
##        36.16875        34.70395
library(dplyr)
library(tidyr)
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
  mutate(baseline = BPRS$week0)

# Fit the linear model with the mean as the response 
fit <- lm(mean~baseline + treatment,  data = BPRSL8S2)

summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + treatment, data = BPRSL8S2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1729  -4.5994   0.1088   4.6703  21.0656 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.07897    4.55710   2.870  0.00675 ** 
## baseline     0.49127    0.08943   5.493 3.05e-06 ***
## treatment2  -0.58879    2.49584  -0.236  0.81480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.872 on 37 degrees of freedom
## Multiple R-squared:  0.4494, Adjusted R-squared:  0.4196 
## F-statistic:  15.1 on 2 and 37 DF,  p-value: 1.605e-05

RATS data set

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- pivot_longer(RATS, cols=-c(ID,Group), names_to = "WD",values_to = "Weight")  %>%  mutate(Time = as.integer(substr(WD,3,4))) %>% arrange(Time)


# Plot the RATSL data
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
  geom_line()

# access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
##    Data: RATSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   1333.2   1352.2   -660.6   1321.2      170 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5386 -0.5581 -0.0494  0.5693  3.0990 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 1085.92  32.953  
##  Residual               66.44   8.151  
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 244.06890   11.73107   20.80
## Time          0.58568    0.03158   18.54
## Group2      220.98864   20.23577   10.92
## Group3      262.07955   20.23577   12.95
## 
## Correlation of Fixed Effects:
##        (Intr) Time   Group2
## Time   -0.090              
## Group2 -0.575  0.000       
## Group3 -0.575  0.000  0.333
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
##    Data: RATSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   1194.2   1219.6   -589.1   1178.2      168 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2259 -0.4322  0.0555  0.5637  2.8825 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  ID       (Intercept) 1139.2004 33.7520       
##           Time           0.1122  0.3349  -0.22
##  Residual               19.7479  4.4439       
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 246.46821   11.80888  20.871
## Time          0.58568    0.08548   6.852
## Group2      214.55803   20.16986  10.638
## Group3      258.91288   20.16986  12.837
## 
## Correlation of Fixed Effects:
##        (Intr) Time   Group2
## Time   -0.166              
## Group2 -0.569  0.000       
## Group3 -0.569  0.000  0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## RATS_ref     6 1333.2 1352.2 -660.58   1321.2                         
## RATS_ref1    8 1194.2 1219.6 -589.11   1178.2 142.94  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Observed weight (grams)") +
  theme(legend.position = "top")


(more chapters to be added similarly as we proceed with the course!)